Skip to content

feat: Phase 3 Advanced Features - structure extraction, OCR, caching, tracking#9

Merged
krisoye13 merged 1 commit intomainfrom
feature/phase3-features
Feb 2, 2026
Merged

feat: Phase 3 Advanced Features - structure extraction, OCR, caching, tracking#9
krisoye13 merged 1 commit intomainfrom
feature/phase3-features

Conversation

@krisoye
Copy link
Copy Markdown
Owner

@krisoye krisoye commented Feb 2, 2026

Summary

Implements Phase 3 of Epic #21 (Document Analysis MCP Server) with advanced features:

  • pdf_ocr tool - OCR for scanned PDFs using Tesseract
  • pdf_extract_structure tool - Extract TOC, tables, and section headings
  • Document caching - Hash-based deduplication with configurable TTL
  • Usage tracking - Token logging and cost estimation per operation
  • Utility tools - cache_stats and usage_summary for monitoring

Changes

New Tools

Tool Description
pdf_ocr OCR for image-based PDFs with Tesseract
pdf_extract_structure Extract document structure (TOC, tables, headings)
cache_stats View cache statistics
usage_summary View API usage and costs

New Modules

  • src/document_analysis_mcp/cache/__init__.py - Hash-based caching
  • src/document_analysis_mcp/tracking/__init__.py - Usage tracking
  • src/document_analysis_mcp/tools/ocr.py - OCR tool
  • src/document_analysis_mcp/tools/structure.py - Structure extraction

Updated Files

  • server.py - Register new tools, v0.3.0
  • pyproject.toml - Version bump

Tests

  • 210 tests passing
  • Full coverage for new modules
  • Tests for cache, tracking, OCR, and structure extraction

Test Plan

  • All 210 tests passing
  • Ruff linting passes
  • Ruff format check passes
  • Manual testing with sample PDFs
  • Integration testing with Tesseract OCR

Closes krisoye/project-tracker#96

Generated with Claude Code

… tracking

Implements Phase 3 of Epic #21 (Document Analysis MCP Server):

## New Tools
- **pdf_ocr** - OCR for scanned PDFs using Tesseract
  - Automatic fallback when text extraction fails
  - Configurable language and DPI
  - Force OCR option for guaranteed image-to-text

- **pdf_extract_structure** - Document structure extraction
  - Table of Contents (TOC) detection
  - Table extraction with markdown formatting
  - Section/heading hierarchy detection

## New Modules
- **cache/** - Hash-based document caching
  - SHA-256 content hashing for deduplication
  - Configurable TTL via CACHE_TTL_DAYS env
  - Automatic cleanup of expired entries
  - Parameter-aware cache keys for different tool configs

- **tracking/** - API usage tracking
  - Token usage logging per operation
  - Cost estimation by model
  - Daily summary reports

## Additional Tools
- **cache_stats** - View cache statistics
- **usage_summary** - View API usage and costs

## Version
- Bumped to v0.3.0

Closes krisoye/project-tracker#96

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@krisoye13 krisoye13 merged commit 2de9e22 into main Feb 2, 2026
4 checks passed
krisoye pushed a commit that referenced this pull request Feb 2, 2026
- Add file locking (fcntl) to cache metadata operations for concurrent access
- Add threading.Lock for in-memory cache metadata protection
- Add file locking to usage tracking append and read operations
- Add language parameter validation for OCR tool with VALID_LANGUAGES set
- Add atomic metadata writes using temp file + rename pattern
- Add comprehensive concurrent operation tests for cache and tracking

Thread safety improvements:
- Cache: _save_metadata() uses exclusive lock with atomic write
- Cache: _load_metadata() uses shared lock for concurrent reads
- Cache: All metadata modifications protected by threading.Lock
- Tracking: record() uses exclusive lock for append operations
- Tracking: get_records() uses shared lock for read operations

Input validation:
- OCR: Invalid language codes log warning and fall back to "eng"
- OCR: VALID_LANGUAGES includes 28 common Tesseract language codes

Fixes issues identified in QA review of PR #9

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
krisoye13 added a commit that referenced this pull request Feb 2, 2026
- Add file locking (fcntl) to cache metadata operations for concurrent access
- Add threading.Lock for in-memory cache metadata protection
- Add file locking to usage tracking append and read operations
- Add language parameter validation for OCR tool with VALID_LANGUAGES set
- Add atomic metadata writes using temp file + rename pattern
- Add comprehensive concurrent operation tests for cache and tracking

Thread safety improvements:
- Cache: _save_metadata() uses exclusive lock with atomic write
- Cache: _load_metadata() uses shared lock for concurrent reads
- Cache: All metadata modifications protected by threading.Lock
- Tracking: record() uses exclusive lock for append operations
- Tracking: get_records() uses shared lock for read operations

Input validation:
- OCR: Invalid language codes log warning and fall back to "eng"
- OCR: VALID_LANGUAGES includes 28 common Tesseract language codes

Fixes issues identified in QA review of PR #9

Co-authored-by: Krisoye Smith <krisoye@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
@krisoye krisoye deleted the feature/phase3-features branch February 6, 2026 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants